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Abstract 

Understanding entailment and contradic¬ 
tion is fundamental to understanding nat¬ 
ural language, and inference about entail¬ 
ment and contradiction is a valuable test¬ 
ing ground for the development of seman¬ 
tic representations. However, machine 
learning research in this area has been dra¬ 
matically limited by the lack of large-scale 
resources. To address this, we introduce 
the Stanford Natural Language Inference 
corpus, a new, freely available collection 
of labeled sentence pairs, written by hu¬ 
mans doing a novel grounded task based 
on image captioning. At 570K pairs, it 
is two orders of magnitude larger than 
all other resources of its type. This in¬ 
crease in scale allows lexicalized classi¬ 
fiers to outperform some sophisticated ex¬ 
isting entailment models, and it allows a 
neural network-based model to perform 
competitively on natural language infer¬ 
ence benchmarks for the first time. 

1 Introduction 

The semantic concepts of entailment and contra¬ 
diction are central to all aspects of natural lan¬ 
guage meaning (Katz, 1972; van Benthem, 2008), 
from the lexicon to the content of entire texts. 
Thus, natural language inference (NLI) — charac¬ 
terizing and using these relations in computational 
systems (Fyodorov et al., 2000; Condoravdi et al., 
2003; Bos and Markert, 2005; Dagan et al., 2006; 
MacCartney and Manning, 2009) — is essential in 
tasks ranging from information retrieval to seman¬ 
tic parsing to commonsense reasoning. 

NLI has been addressed using a variety of tech¬ 
niques, including those based on symbolic logic, 
knowledge bases, and neural networks. In recent 
years, it has become an important testing ground 


for approaches employing distributed word and 
phrase representations. Distributed representa¬ 
tions excel at capturing relations based in similar¬ 
ity, and have proven effective at modeling simple 
dimensions of meaning like evaluative sentiment 
(e.g., Socher et al. 2013), but it is less clear that 
they can be trained to support the full range of 
logical and commonsense inferences required for 
NLI (Bowman et al., 2015; Weston et al., 2015b; 
Weston et al., 2015a). In a SemEval 2014 task 
aimed at evaluating distributed representations for 
NLI, the best-performing systems relied heavily 
on additional features and reasoning capabilities 
(Marelli et al., 2014a). 

Our ultimate objective is to provide an empiri¬ 
cal evaluation of learning-centered approaches to 
NLI, advancing the case for NLI as a tool for 
the evaluation of domain-general approaches to 
semantic representation. However, in our view, 
existing NLI corpora do not permit such an as¬ 
sessment. They are generally too small for train¬ 
ing modern data-intensive, wide-coverage models, 
many contain sentences that were algorithmically 
generated, and they arc often beset with indeter- 
minacies of event and entity coreference that sig¬ 
nificantly impact annotation quality. 

To address this, this paper introduces the Stan¬ 
ford Natural Language Inference (SNLI) corpus, 
a collection of sentence pairs labeled for entail¬ 
ment, contradiction, and semantic independence. 
At 570,152 sentence pairs, SNLI is two orders of 
magnitude larger than all other resources of its 
type. And, in contrast to many such resources, 
all of its sentences and labels were written by hu¬ 
mans in a grounded, naturalistic context. In a sepa¬ 
rate validation phase, we collected four additional 
judgments for each label for 56,941 of the exam¬ 
ples. Of these, 98% of cases emerge with a three- 
annotator consensus, and 58% see a unanimous 
consensus from all five annotators. 

In this paper, we use this corpus to evaluate 



A man inspects the uniform of a figure in some East 
Asian country. 

contradiction 

c c c c c 

The man is sleeping 

An older and younger man smiling. 

neutral 

N N E N N 

Two men are smiling and laughing at the cats play¬ 
ing on the floor. 

A black race car starts up in front of a crowd of 
people. 

contradiction 

c cccc 

A man is driving down a lonely road. 

A soccer game with multiple males playing. 

entailment 

E E E E E 

Some men are playing a sport. 

A smiling costumed woman is holding an um¬ 
brella. 

neutral 

N N E C N 

A happy woman in a fairy costume holds an um¬ 
brella. 


Table 1: Randomly chosen examples from the development section of our new corpus, shown with both 
the selected gold labels and the full set of labels (abbreviated) from the individual annotators, including 
(in the first position) the label used by the initial author of the pair. 


a variety of models for natural language infer¬ 
ence, including rule-based systems, simple lin¬ 
eal - classifiers, and neural network-based models. 
We find that two models achieve comparable per¬ 
formance: a feature-rich classifier model and a 
neural network model centered around a Long 
Short-Term Memory network (LSTM; Hochreiter 
and Schmidhuber 1997). We further evaluate the 
LSTM model by taking advantage of its ready sup¬ 
port for transfer learning, and show that it can be 
adapted to an existing NLI challenge task, yielding 
the best reported performance by a neural network 
model and approaching the overall state of the art. 

2 A new corpus for NLI 

To date, the primary sources of annotated NLI cor¬ 
pora have been the Recognizing Textual Entail- 
ment (RTE) challenge tasks. 1 These are generally 
high-quality, hand-labeled data sets, and they have 
stimulated innovative logical and statistical mod¬ 
els of natural language reasoning, but their small 
size (fewer than a thousand examples each) limits 
their utility as a testbed for learned distributed rep¬ 
resentations. The data for the SemEval 2014 task 
called Sentences Involving Compositional Knowl¬ 
edge (SICK) is a step up in terms of size, but 
only to 4,500 training examples, and its partly 
automatic construction introduced some spurious 
patterns into the data (Marelli et al. 2014a, §6). 
The Denotation Graph entailment set (Young et 
al., 2014) contains millions of examples of en- 
tailments between sentences and artificially con¬ 
structed short phrases, but it was labeled using 
fully automatic methods, and is noisy enough that 
it is probably suitable only as a source of sup- 

*http://aclweb.org/aclwiki/index.php? 
title=Textual_Entailment_Resource_Pool 


plementary training data. Outside the domain of 
sentence-level entailment. Levy et al. (2014) intro¬ 
duce a large corpus of semi-automatically anno¬ 
tated entailment examples between subject-verb- 
object relation triples, and the second release of 
the Paraphrase Database (Pavlick et al., 2015) in¬ 
cludes automatically generated entailment anno¬ 
tations over a large corpus of pairs of words and 
short phrases. 

Existing resources suffer from a subtler issue 
that impacts even projects using only human- 
provided annotations: indeterminacies of event 
and entity coreference lead to insurmountable in¬ 
determinacy concerning the correct semantic la¬ 
bel (de MarneITc et al. 2008 §4.3; Marelli et al. 
2014b). For an example of the pitfalls surround¬ 
ing entity coreference, consider the sentence pair 
A boat sank in the Pacific Ocean and A boat sank 
in the Atlantic Ocean. The pair could be labeled 
as a contradiction if one assumes that the two sen¬ 
tences refer to the same single event, but could 
also be reasonably labeled as neutral if that as¬ 
sumption is not made. In order to ensure that our 
labeling scheme assigns a single correct label to 
every pair, we must select one of these approaches 
across the board, but both choices present prob¬ 
lems. If we opt not to assume that events are 
coreferent, then we will only ever find contradic¬ 
tions between sentences that make broad univer¬ 
sal assertions, but if we opt to assume coreference, 
new counterintuitive predictions emerge. For ex¬ 
ample, Ruth Bader Ginsburg was appointed to the 
US Supreme Court and I had a sandwich for lunch 
today would unintuitively be labeled as a contra¬ 
diction, rather than neutral, under this assumption. 

Entity coreference presents a similar kind of in¬ 
determinacy, as in the pair A tourist visited New 



York and A tourist visited the city. Assuming 
coreference between New York and the city justi¬ 
fies labeling the pair as an entailment, but with¬ 
out that assumption the city could be taken to refer 
to a specific unknown city, leaving the pair neu¬ 
tral. This kind of indeterminacy of label can be re¬ 
solved only once the questions of coreference are 
resolved. 

With SNLI, we sought to address the issues of 
size, quality, and indeterminacy. To do this, we 
employed a crowdsourcing framework with the 
following crucial innovations. First, the exam¬ 
ples were grounded in specific scenarios, and the 
premise and hypothesis sentences in each exam¬ 
ple were constrained to describe that scenario from 
the same perspective, which helps greatly in con¬ 
trolling event and entity coreference. 2 Second, the 
prompt gave participants the freedom to produce 
entirely novel sentences within the task setting, 
which led to richer examples than we see with the 
more proscribed string-editing techniques of ear¬ 
lier approaches, without sacrificing consistency. 
Third, a subset of the resulting sentences were sent 
to a validation task aimed at providing a highly re¬ 
liable set of annotations over the same data, and at 
identifying areas of inferential uncertainty. 

2.1 Data collection 

We used Amazon Mechanical Turk for data col¬ 
lection. In each individual task (each HIT), a 
worker was presented with premise scene descrip¬ 
tions from a pre-existing corpus, and asked to 
supply hypotheses for each of our three labels— 
entailment, neutral, and contradiction —forcing 
the data to be balanced among these classes. 

The instructions that we provided to the work¬ 
ers arc shown in Figure 1. Below the instructions 
were three fields for each of three requested sen¬ 
tences, corresponding to our entailment, neutral, 
and contradiction labels, a fourth field (marked 
optional) for reporting problems, and a link to an 
FAQ page. That FAQ grew over the course of 
data collection. It warned about disallowed tech¬ 
niques (e.g., reusing the same sentence for many 
different prompts, which we saw in a few cases), 
provided guidance concerning sentence length and 

2 Issues of coreference are not completely solved, but 
greatly mitigated. For example, with the premise sentence 
A dog is lying in the grass, a worker could safely assume that 
the dog is the most prominent thing in the photo, and very 
likely the only dog, and build contradicting sentences assum¬ 
ing reference to the same dog. 


We will show you the caption for a photo. We will not 
show you the photo. Using only the caption and what 
you know about the world: 

• Write one alternate caption that is definitely a 
true description of the photo. Example: For the 
caption “Two dogs are running through afield.” 
you could write “There are animals outdoors. ” 

• Write one alternate caption that might be a true 
description of the photo. Example: For the cap¬ 
tion “Two dogs are running through afield. ” you 
could write “Some puppies are running to catch a 
stick. ” 

• Write one alternate caption that is definitely a 
false description of the photo. Example: For the 
caption “Two dogs are running through afield.” 
you could write “The pets are sitting on a couch. ” 
This is different from the maybe correct category 
because it’s impossible for the dogs to be both 
running and sitting. 


Figure 1: The instructions used on Mechanical 
Turk for data collection. 

complexity (we did not enforce a minimum length, 
and we allowed bare NPs as well as full sen¬ 
tences), and reviewed logistical issues around pay¬ 
ment timing. About 2,500 workers contributed. 

For the premises, we used captions from the 
Flickr30k corpus (Young et al., 2014), a collection 
of approximately 160k captions (corresponding to 
about 30k images) collected in an earlier crowd- 
sourced effort. 3 The captions were not authored 
by the photographers who took the source images, 
and they tend to contain relatively literal scene de¬ 
scriptions that are suited to our approach, rather 
than those typically associated with personal pho¬ 
tographs (as in their example: Our trip to the 
Olympic Peninsula). In order to ensure that the la¬ 
bel for each sentence pair can be recovered solely 
based on the available text, we did not use the im¬ 
ages at all during corpus collection. 

Table 2 reports some key statistics about the col¬ 
lected corpus, and Figure 2 shows the distributions 
of sentence lengths for both our source hypotheses 
and our newly collected premises. We observed 
that while premise sentences varied considerably 
in length, hypothesis sentences tended to be as 

3 We additionally include about 4k sentence pairs from 
a pilot study in which the premise sentences were instead 
drawn from the VisualGenome corpus (under construction; 
visualgenome . org). These examples appear only in the 
training set, and have pair identifiers prefixed with vg in our 
corpus. 




Data set sizes: 


Training pairs 

550,152 

Development pairs 

10,000 

Test pairs 

10,000 

Sentence length: 


Premise mean token count 

14.1 

Hypothesis mean token count 

8.3 

Parser output: 


Premise ‘S’-rooted parses 

74.0% 

Hypothesis ‘S’-rooted parses 

88.9% 

Distinct words (ignoring case) 

37,026 


Table 2: Key statistics for the raw sentence pairs 
in SNLI. Since the two halves of each pair were 
collected separately, we report some statistics for 
both. 


short as possible while still providing enough in¬ 
formation to yield a clear judgment, clustering at 
around seven words. We also observed that the 
bulk of the sentences from both sources were syn¬ 
tactically complete rather than fragments, and the 
frequency with which the parser produces a parse 
rooted with an ‘S’ (sentence) node attests to this. 

2.2 Data validation 

In order to measure the quality of our corpus, 
and in order to construct maximally useful test¬ 
ing and development sets, we performed an addi¬ 
tional round of validation for about 10% of our 
data. This validation phase followed the same 
basic form as the Mechanical Turk labeling task 
used to label the SICK entailment data: we pre¬ 
sented workers with pair's of sentences in batches 
of five, and asked them to choose a single label 
for each pair. We supplied each pair to four an¬ 
notators, yielding five labels per pair including the 
label used by the original author. The instructions 
were similar to the instructions for initial data col¬ 
lection shown in Figure 1, and linked to a similar 
FAQ. Though we initially used a very restrictive 
qualification (based on past approval rate) to se¬ 
lect workers for the validation task, we nonethe¬ 
less discovered (and deleted) some instances of 
random guessing in an early batch of work, and 
subsequently instituted a fully closed qualification 
restricted to about 30 trusted workers. 

For each pair that we validated, we assigned a 
gold label. If any one of the three labels was cho¬ 
sen by at least three of the five annotators, it was 


Premise D Hypothesis 



Figure 2: The distribution of sentence length. 


chosen as the gold label. If there was no such con¬ 
sensus, which occurred in about 2% of cases, we 
assigned the placeholder label While these un¬ 
labeled examples are included in the corpus dis¬ 
tribution, they are unlikely to be helpful for the 
standard NLI classification task, and we do not in¬ 
clude them in either training or evaluation in the 
experiments that we discuss in this paper. 

The results of this validation process are sum¬ 
marized in Table 3. Nearly all of the examples 
received a majority label, indicating broad con¬ 
sensus about the nature of the data and categories. 
The gold-labeled examples are very nearly evenly 
distributed across the three labels. The Fleiss 
k scores (computed over every example with a 
full five annotations) are likely to be conservative 
given our large and unevenly distributed pool of 
annotators, but they still provide insights about the 
levels of disagreement across the three semantic 
classes. This disagreement likely reflects not just 
the limitations of large crowdsourcing efforts but 
also the uncertainty inherent in naturalistic NLI. 
Regardless, the overall rate of agreement is ex¬ 
tremely high, suggesting that the corpus is suffi¬ 
ciently high quality to pose a challenging but real¬ 
istic machine learning task. 

2.3 The distributed corpus 

Table 1 shows a set of randomly chosen validated 
examples from the development set with their la¬ 
bels. Qualitatively, we find the data that we col¬ 
lected draws fairly extensively on commonsense 
knowledge, and that hypothesis and premise sen¬ 
tences often differ structurally in significant ways, 
suggesting that there is room for improvement be¬ 
yond superficial word alignment models. We also 
find the sentences that we collected to be largely 

































General: 

Validated pairs 56,951 

Pairs w/unanimous gold label 58.3% 


Individual annotator label agreement: 

Individual label = gold label 89.0% 

Individual label = author’s label 85.8% 

Gold label/author’s label agreement: 

Gold label = author’s label 

91.2% 

Gold label / author’s label 

6.8% 

No gold label (no 3 labels match) 

2.0% 

Fleiss k : 


contradiction 

0.77 

entailment 

0.72 

neutral 

0.60 

Overall 

0.70 


Table 3: Statistics for the validated pairs. The au¬ 
thor’s label is the label used by the worker who 
wrote the premise to create the sentence pair. A 
gold label reflects a consensus of three votes from 
among the author and the four annotators. 

fluent, correctly spelled English, with a mix of 
full sentences and caption-style noun phrase frag¬ 
ments, though punctuation and capitalization are 
often omitted. 

The corpus is available under a CreativeCom- 
mons Attribution-Share Alike license, the same li¬ 
cense used for the Flickr30k source captions. It 
can be downloaded at: 

nlp.stanford.edu/projects/snli/ 

Partition We distribute the corpus with a pre¬ 
specified train/test/development split. The test 
and development sets contain 10k examples each. 
Each original ImageFlickr caption occurs in only 
one of the three sets, and all of the examples in the 
test and development sets have been validated. 

Parses The distributed corpus includes parses 
produced by the Stanford PCFG Parser 3.5.2 
(Klein and Manning, 2003), trained on the stan¬ 
dard training set as well as on the Brown Corpus 
(Francis and Kucera 1979), which we found to im¬ 
prove the parse quality of the descriptive sentences 
and noun phrases found in the descriptions. 

3 Our data as a platform for evaluation 

The most immediate application for our corpus is 
in developing models for the task of NFI. In par- 


System 

SNLI 

SICK 

RTE-3 

Edit Distance Based 

71.9 

65.4 

61.9 

Classifier Based 

72.2 

71.4 

61.5 

+ Lexical Resources 

75.0 

78.8 

63.6 


Table 4: 2-class test accuracy for two simple 
baseline systems included in the Excitement Open 
Platform, as well as SICK and RTE results for a 
model making use of more sophisticated lexical 
resources. 

ticular, since it is dramatically larger than any ex¬ 
isting corpus of comparable quality, we expect it to 
be suitable for training parameter-rich models like 
neural networks, which have not previously been 
competitive at this task. Our ability to evaluate 
standard classifier-base NFI models, however, was 
limited to those which were designed to scale to 
SNEI’s size without modification, so a more com¬ 
plete comparison of approaches will have to wait 
for future work. In this section, we explore the per¬ 
formance of three classes of models which could 
scale readily: (i) models from a well-known NFI 
system, the Excitement Open Platform; (ii) vali¬ 
ants of a strong but simple feature-based classi¬ 
fier model, which makes use of both unlexicalized 
and lexicalized features, and (iii) distributed repre¬ 
sentation models, including a baseline model and 
neural network sequence models. 

3.1 Excitement Open Platform models 

The first class of models is from the Excitement 
Open Platform (EOP, Pado et al. 2014; Magnini 
et al. 2014)—an open source platform for RTE re¬ 
search. EOP is a tool for quickly developing NLI 
systems while sharing components such as com¬ 
mon lexical resources and evaluation sets. We 
evaluate on two algorithms included in the dis¬ 
tribution: a simple edit-distance based algorithm 
and a classifier-based algorithm, the latter both in 
a baie form and augmented with EOP’s full suite 
of lexical resources. 

Our initial goal was to better understand the dif¬ 
ficulty of the task of classifying SNLI corpus in¬ 
ferences, rather than necessarily the performance 
of a state-of-the-art RTE system. We approached 
this by running the same system on several data 
sets: our own test set, the SICK test data, and the 
standard RTE-3 test set (Giampiccolo et al., 2007). 
We report results in Table 4. Each of the models 



was separately trained on the training set of each 
corpus. All models are evaluated only on 2-class 
entailment. To convert 3-class problems like SICK 
and SNLI to this setting, all instances of contradic¬ 
tion and unknown arc converted to nonentailment. 
This yields a most-frequent-class baseline accu¬ 
racy of 66% on SNLI, and 71% on SICK. This is 
intended primarily to demonstrate the difficulty of 
the task, rather than necessarily the performance 
of a state-of-the-art RTE system. The edit dis¬ 
tance algorithm tunes the weight of the three case- 
insensitive edit distance operations on the train¬ 
ing set, after removing stop words. In addition 
to the base classifier-based system distributed with 
the platform, we train a valiant which includes in¬ 
formation from WordNet (Miller, 1995) and Verb- 
Ocean (Chklovski and Pantel, 2004), and makes 
use of features based on tree patterns and depen¬ 
dency tree skeletons (Wang and Neumann, 2007). 

3.2 Lexicalized Classifier 

Unlike the RTE datasets, SNLI’s size supports ap¬ 
proaches which make use of rich lexicalized fea¬ 
tures. We evaluate a simple lexicalized classifier 
to explore the ability of non-specialized models to 
exploit these features in lieu of more involved lan¬ 
guage understanding. Our classifier implements 6 
feature types; 3 unlexicalized and 3 lexicalized: 

1. The BLEU score of the hypothesis with re¬ 
spect to the premise, using an n-gram length 
between 1 and 4. 

2. The length difference between the hypothesis 
and the premise, as a real-valued feature. 

3. The overlap between words in the premise 
and hypothesis, both as an absolute count and 
a percentage of possible overlap, and both 
over all words and over just nouns, verbs, ad¬ 
jectives, and adverbs. 

4. An indicator for every unigram and bigram in 
the hypothesis. 

5. Cross-unigrams: for every pair of words 
across the premise and hypothesis which 
share a POS tag, an indicator feature over the 
two words. 

6. Cross-bigrams: for every pair of bigrams 
across the premise and hypothesis which 
share a POS tag on the second word, an in¬ 
dicator feature over the two bigrams. 

We report results in Table 5, along with abla¬ 
tion studies for removing the cross-bigram fea¬ 
tures (leaving only the cross-unigram feature) and 


System 

SNLI 

Train Test 

SICK 

Train Test 

Lexicalized 

99.7 

78.2 

90.4 

77.8 

Unigrams Only 

93.1 

71.6 

88.1 

77.0 

Unlexicalized 

49.4 

50.4 

69.9 

69.6 


Table 5: 3-class accuracy, training on either our 
data or SICK, including models lacking cross¬ 
bigram features (Feature 6), and lacking all lexical 
features (Features 4-6). We report results both on 
the test set and the training set to judge overfitting. 

for removing all lexicalized features. On our large 
corpus in particular, there is a substantial jump in 
accuracy from using lexicalized features, and an¬ 
other from using the very sparse cross-bigram fea¬ 
tures. The latter result suggests that there is value 
in letting the classifier automatically learn to rec¬ 
ognize structures like explicit negations and adjec¬ 
tive modification. A similar result was shown in 
Wang and Manning (2012) for bigram features in 
sentiment analysis. 

It is surprising that the classifier performs as 
well as it does without any notion of alignment 
or tree transformations. Although we expect that 
richer models would perform better, the results 
suggest that given enough data, cross bigrams with 
the noisy part-of-speech overlap constraint can 
produce an effective model. 

3.3 Sentence embeddings and NLI 

SNLI is suitably large and diverse to make it pos¬ 
sible to train neural network models that produce 
distributed representations of sentence meaning. 
In this section, we compare the performance of 
three such models on the coipus. To focus specif¬ 
ically on the strengths of these models at produc¬ 
ing informative sentence representations, we use 
sentence embedding as an intermediate step in the 
NLI classification task: each model must produce 
a vector representation of each of the two sen¬ 
tences without using any context from the other 
sentence, and the two resulting vectors arc then 
passed to a neural network classifier which pre¬ 
dicts the label for the pair. This choice allows us to 
focus on existing models for sentence embedding, 
and it allows us to evaluate the ability of those 
models to learn useful representations of mean¬ 
ing (which may be independently useful for sub¬ 
sequent tasks), at the cost of excluding from con- 



3-way softmax classifier 

t 


Sentence model 

Train 

Test 

lOOd Sum of words 

79.3 

75.3 

lOOd RNN 

73.1 

72.2 

lOOd LSTM RNN 

84.8 

77.6 


200d tanh layer 



Figure 3: The neural network classification archi¬ 
tecture: for each sentence embedding model eval¬ 
uated in Tables 6 and 7, two identical copies of 
the model are run with the two sentences as input, 
and their outputs are used as the two lOOd inputs 
shown here. 

sideration possible strong neural models for NLI 
that directly compare the two inputs at the word or 
phrase level. 

Our neural network classifier, depicted in Fig¬ 
ure 3 (and based on a one-layer model in Bow¬ 
man et al. 2015), is simply a stack of three 200d 
tanh layers, with the bottom layer taking the con¬ 
catenated sentence representations as input and the 
top layer feeding a softmax classifier, all trained 
jointly with the sentence embedding model itself. 

We test three sentence embedding models, each 
set to use lOOd phrase and sentence embeddings. 
Our baseline sentence embedding model simply 
sums the embeddings of the words in each sen¬ 
tence. In addition, we experiment with two simple 
sequence embedding models: a plain RNN and an 
LSTM RNN (Hochreiter and Schmidhuber, 1997). 

The word embeddings for all of the models are 
initialized with the 300d reference GloVe vectors 
(840B token version, Pennington et al. 2014) and 
fine-tuned as part of training. In addition, all 
of the models use an additional tanh neural net¬ 
work layer to map these 300d embeddings into 
the lower-dimensional phrase and sentence em¬ 
bedding space. All of the models are randomly 
initialized using standard techniques and trained 
using AdaDelta (Zeiler, 2012) minibatch SGD un¬ 
til performance on the development set stops im¬ 
proving. We applied L2 regularization to all mod¬ 
els, manually tuning the strength coefficient A for 
each, and additionally applied dropout (Srivastava 
et al., 2014) to the inputs and outputs of the sen- 


Table 6: Accuracy in 3-class classification on our 
training and test sets for each model. 

tence embedding models (though not to its internal 
connections) with a fixed dropout rate. All mod¬ 
els were implemented in a common framework for 
this paper, and the implementations will be made 
available at publication time. 

The results are shown in Table 6. The sum 
of words model performed slightly worse than 
the fundamentally similar lexicalized classifier— 
while the sum of words model can use pretrained 
word embeddings to better handle rare words, it 
lacks even the rudimentary sensitivity to word or¬ 
der that the lexicalized model’s bigram features 
provide. Of the two RNN models, the LSTM’s 
more robust ability to learn long-term dependen¬ 
cies serves it well, giving it a substantial advan¬ 
tage over the plain RNN, and resulting in perfor¬ 
mance that is essentially equivalent to the lexical¬ 
ized classifier on the test set (LSTM performance 
near the stopping iteration varies by up to 0.5% 
between evaluation steps). While the lexicalized 
model fits the training set almost perfectly, the gap 
between train and test set accuracy is relatively 
small for all three neural network models, suggest¬ 
ing that research into significantly higher capacity 
versions of these models would be productive. 

3.4 Analysis and discussion 

Figure 4 shows a learning curve for the LSTM and 
the lexicalized and unlexicalized feature-based 
models. It shows that the large size of the corpus 
is crucial to both the LSTM and the lexicalized 
model, and suggests that additional data would 
yield still better performance for both. In addi¬ 
tion, though the LSTM and the lexicalized model 
show similar performance when trained on the cur¬ 
rent full corpus, the somewhat steeper slope for 
the LSTM hints that its ability to learn arbitrar¬ 
ily structured representations of sentence mean¬ 
ing may give it an advantage over the more con¬ 
strained lexicalized model on still larger datasets. 

We were struck by the speed with which the 
lexicalized classifier outperforms its unlexicalized 
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Figure 4: A learning curve showing how the 
baseline classifiers and the LSTM perform when 
trained to convergence on varied amounts of train¬ 
ing data. The y-axis starts near a random-chance 
accuracy of 33%. The minibatch size of 64 that 
we used to tune the LSTM sets a lower bound on 
data for that model. 


counterpart. With only 100 training examples, the 
cross-bigram classifier is already performing bet¬ 
ter. Empirically, we find that the top weighted 
features for the classifier trained on 100 examples 
tend to be high precision entailments; e.g., playing 
—y outside (most scenes are outdoors), a banana 
—> person eating. If relatively few spurious entail¬ 
ments get high weight—as it appears is the case— 
then it makes sense that, when these do fire, they 
boost accuracy in identifying entailments. 

There are revealing patterns in the errors com¬ 
mon to all the models considered here. Despite 
the large size of the training corpus and the distri¬ 
butional information captured by GloVe initializa¬ 
tion, many lexical relationships are still misana- 
lyzed, leading to incorrect predictions of indepen¬ 
dent, even for pairs that are common in the train¬ 
ing corpus like beach/surf and sprinter/runner. 
Semantic mistakes at the phrasal level (e.g., pre¬ 
dicting contradiction for A male is placing an 
order in a deli/A man buying a sandwich at a 
deli ) indicate that additional attention to composi¬ 
tional semantics would pay off. However, many of 
the persistent problems run deeper, to inferences 
that depend on world knowledge and context- 
specific inferences, as in the entailment pair A race 
car driver leaps from a burning carl A race car 
driver escaping danger, for which both the lex¬ 
icalized classifier and the LSTM predict neutral. 
In other cases, the models’ attempts to shortcut 


this kind of inference through lexical cues can lead 
them astray. Some of these examples have quali¬ 
ties reminiscent of Winograd schemas (Winograd, 
1972; Levesque, 2013). For example, all the mod¬ 
els wrongly predict entailment for A young girl 
throws sand toward the ocean!A girl can’t stand 
the ocean, presumably because of distributional 
associations between throws and can’t stand. 

Analysis of the models’ predictions also yields 
insights into the extent to which they grapple with 
event and entity coreference. For the most paid, the 
original image prompts contained a focal element 
that the caption writer identified with a syntac¬ 
tic subject, following information structuring con¬ 
ventions associating subjects and topics in English 
(Ward and Birner, 2004). Our annotators generally 
followed suit, writing sentences that, while struc¬ 
turally diverse, share topic/focus (theme/rheme) 
structure with their premises. This promotes a 
coherent, situation-specific construal of each sen¬ 
tence pair. This is information that our models 
can easily take advantage of, but it can lead them 
astray. For instance, all of them stumble with the 
amusingly simple case A woman prepares ingre¬ 
dients for a bowl of soup/A soup bowl prepares a 
woman, in which prior expectations about paral¬ 
lelism are not met. Another headline example of 
this type is A man wearing padded arm protec¬ 
tion is being bitten by a German shepherd dog/A 
man bit a dog, which all the models wrongly di¬ 
agnose as entailment, though the sentences report 
two very different stories. A model with access 
to explicit information about syntactic or semantic 
structure should perform better on cases like these. 

4 Transfer learning with SICK 

To the extent that successfully training a neural 
network model like our LSTM on SNLI forces that 
model to encode broadly accurate representations 
of English scene descriptions and to build an en¬ 
tailment classifier over those relations, we should 
expect it to be readily possible to adapt the trained 
model for use on other NLI tasks. In this section, 
we evaluate on the SICK entailment task using a 
simple transfer learning method (Pratt et ah, 1991) 
and achieve competitive results. 

To perform transfer, we take the parameters of 
the LSTM RNN model trained on SNLI and use 
them to initialize a new model, which is trained 
from that point only on the training portion of 
SICK. The only newly initialized parameters are 























Training sets 

Train 

Test 

Our data only 

42.0 

46.7 

SICK only 

100.0 

71.3 

Our data and SICK (transfer) 

99.9 

80.8 


Table 7: LSTM 3-class accuracy on the SICK 
train and test sets under three training regimes. 


softmax layer parameters and the embeddings for 
words that appeal - in SICK, but not in SNLI (which 
are populated with GloVe embeddings as above). 
We use the same model hyperparameters that were 
used to train the original model, with the excep¬ 
tion of the L2 regularization strength, which is 
re-tuned. We additionally transfer the accumula¬ 
tors that are used by AdaDelta to set the learn¬ 
ing rates. This lowers the starting learning rates, 
and is intended to ensure that the model does not 
learn too quickly in its first few epochs after trans¬ 
fer and destroy the knowledge accumulated in the 
pre-transfer phase of training. 

The results are shown in Table 7. Training 
on SICK alone yields poor performance, and the 
model trained on SNLI fails when tested on SICK 
data, labeling more neutral examples as contradic¬ 
tions than correctly, possibly as a result of subtle 
differences in how the labeling task was presented. 
In contrast, transferring SNLI representations to 
SICK yields the best performance yet reported for 
an unaugmented neural network model, surpasses 
the available EOP models, and approaches both 
the overall state of the art at 84.6% (Lai and Hock- 
enmaier, 2014) and the 84% level of interannota¬ 
tor agreement, which likely represents an approx¬ 
imate performance ceiling. This suggests that the 
introduction of a large high-quality corpus makes 
it possible to train representation-learning models 
for sentence meaning that are competitive with the 
best hand-engineered models on inference tasks. 

We attempted to apply this same transfer evalu¬ 
ation technique to the RTE-3 challenge, but found 
that the small training set (800 examples) did not 
allow the model to adapt to the unfamiliar genre of 
text used in that corpus, such that no training con¬ 
figuration yielded competitive performance. Fur¬ 
ther research on effective transfer learning on 
small data sets with neural models might facilitate 
improvements here. 


5 Conclusion 

Natural languages are powerful vehicles for rea¬ 
soning, and nearly all questions about meaning¬ 
fulness in language can be reduced to questions of 
entailment and contradiction in context. This sug¬ 
gests that NLI is an ideal testing ground for the¬ 
ories of semantic representation, and that training 
for NLI tasks can provide rich domain-general se¬ 
mantic representations. To date, however, it has 
not been possible to fully realize this potential due 
to the limited nature of existing NLI resources. 
This paper sought to remedy this with a new, large- 
scale, naturalistic corpus of sentence pairs labeled 
for entailment, contradiction, and independence. 
We used this corpus to evaluate a range of models, 
and found that both simple lexicalized models and 
neural network models perform well, and that the 
representations learned by a neural network model 
on our corpus can be used to dramatically improve 
performance on a standard challenge dataset. We 
hope that SNLI presents valuable training data and 
a challenging testbed for the continued application 
of machine learning to semantic representation. 
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